Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plan/executor: handle null for anti join properly #8706

Closed
wants to merge 9 commits into from

Conversation

eurekaka
Copy link
Contributor

@eurekaka eurekaka commented Dec 17, 2018

What problem does this PR solve?

Fix #8642

What is changed and how it works?

  • enhance joiner to be aware of whether the result of join condition is false or null, and act correspondingly for AntiSemiJoin and LeftOuterAntiSemiJoin;
  • differentiate not exists from not in / != all, since they have various behaviors regarding null input;
  • for not in (subq), there are 2 challenges:
    • executor of index join / hash join / merge join filters out null inner key before the real join operation;
    • we don't differentiate in condition and equal conditions pulled up later in decorrelation now, but they should have various behaviors regarding null input as well;
  • so if any column of not in condition is possible to be null, we cannot apply decorrelate rule for it, otherwise we would generate wrong results.

Check List

Tests

Code changes

  • Has interface methods change

Side effects

  • Possible performance regression: this may make some not in queries unable to decorrelate now, but correctness should be much more important than performance.

Related changes

  • Need to cherry-pick to the release branch

This change is Reviewable

@eurekaka eurekaka added type/bugfix This PR fixes a bug. sig/planner SIG: Planner sig/execution SIG execution labels Dec 17, 2018
@eurekaka eurekaka changed the title plan/executor: handle null for AntiSemiJoin properly plan/executor: handle null for anti join properly Dec 17, 2018
@eurekaka
Copy link
Contributor Author

/run-all-tests

@eurekaka
Copy link
Contributor Author

/run-all-tests tidb-test=pr/702

@eurekaka
Copy link
Contributor Author

@zz-jason @winoros @lamxTyler PTAL

@eurekaka
Copy link
Contributor Author

/run-all-tests tidb-test=pr/702

@eurekaka
Copy link
Contributor Author

/run-unit-test

@eurekaka
Copy link
Contributor Author

/run-all-tests tidb-test=pr/702

@eurekaka
Copy link
Contributor Author

@winoros comments addressed, the failed unit-test is caused by goroutine leak in ddl package.

Copy link
Member

@winoros winoros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@eurekaka eurekaka added the status/LGT1 Indicates that a PR has LGTM 1. label Dec 18, 2018
@zhouqiang-cl
Copy link
Contributor

/run-unit-test

executor/joiner.go Outdated Show resolved Hide resolved
executor/joiner.go Show resolved Hide resolved
expression/expression.go Show resolved Hide resolved
expression/expression.go Outdated Show resolved Hide resolved
@eurekaka
Copy link
Contributor Author

@zz-jason comments addressed, PTAL

- enhance joiner to be aware of whether the result of join condition
  is false or null, and act correspondingly for AntiSemiJoin and
  LeftOuterAntiSemiJoin;
- differentiate `not exists` from `not in / != all`, since they have
  various behaviors regarding null input;
- for `not in (subq)`, there are 2 challenges:
  * executor of index join / hash join / merge join filters out null inner
    key before the real join operation;
  * we don't differentiate `in` condition and equal conditions pulled up
    later in decorrelation now, but they should have various behaviors either
    regarding null input;
  so if the inner column of `not in` condition is possible to be null,
  we cannot apply decorrelate rule for it, otherwise we would generate
  wrong results.
- rename isnull to hasNull
- add comments
@zz-jason
Copy link
Member

zz-jason commented Dec 26, 2018

After some investigation from the behavior of MySQL, the join result of LeftOuterSemiJoin and Anti LeftOuterSemiJoin should be one of the following according to different conditions:

  • OuterRow + NULL
  • OuterRow + 1
  • OuterRow + 0

Here the OuterRow is a record from the left hand side table.

1. LeftOuterSemiJoin

  1. If the outer join key contains NULL, the join result should be OuterRow + NULL, else:

  2. If there is at least one inner row has the same join key with the outer row, the join result should be OuterRow + 1, else:

  3. If there is at least one NULL in the inner join keys, the join result should be OuterRow + NULL, else:

  4. the join result should be OuterRow + 0

2. Anti LeftOuterSemiJoin

  1. If the outer join key contains NULL, the join result should be OuterRow + NULL, else:

  2. If there is at least one inner row has the same join key with the outer row, the join result should be OuterRow + 0, else:

  3. If there is at least one NULL in the inner join keys, the join result should be OuterRow + NULL, else:

  4. the join result should be OuterRow + 1

@eurekaka Am I right?

@zz-jason zz-jason closed this Dec 26, 2018
@zz-jason zz-jason reopened this Dec 26, 2018
@eurekaka
Copy link
Contributor Author

@zz-jason yes, you are right.

This is the output of MySQL:

mysql> select * from s;
+------+------+
| a    | b    |
+------+------+
|    1 | NULL |
|    2 |    1 |
+------+------+
2 rows in set (0.00 sec)

mysql> select *, a in (select b from s) from s;
+------+------+------------------------+
| a    | b    | a in (select b from s) |
+------+------+------------------------+
|    1 | NULL |                      1 |
|    2 |    1 |                   NULL |
+------+------+------------------------+
2 rows in set (0.00 sec)

mysql> select *, a not in (select b from s) from s;
+------+------+----------------------------+
| a    | b    | a not in (select b from s) |
+------+------+----------------------------+
|    1 | NULL |                          0 |
|    2 |    1 |                       NULL |
+------+------+----------------------------+

while this is the output after applying this PR:

mysql> select * from t;
+------+------+
| a    | b    |
+------+------+
|    1 | NULL |
|    2 |    1 |
+------+------+
2 rows in set (0.00 sec)

mysql> select *, a in (select b from t) from t;
+------+------+------------------------+
| a    | b    | a in (select b from t) |
+------+------+------------------------+
|    1 | NULL |                      1 |
|    2 |    1 |                      0 |
+------+------+------------------------+
2 rows in set (0.00 sec)

mysql> select *, a not in (select b from t) from t;
+------+------+----------------------------+
| a    | b    | a not in (select b from t) |
+------+------+----------------------------+
|    1 | NULL |                          0 |
|    2 |    1 |                          0 |
+------+------+----------------------------+

For LeftOuterAntiSemiJoin, it is easy to update the code to generate compatible result with MySQL now, but for LeftOuterSemiJoin, we have to disable de-correlation for LogicalApply with LeftOuterSemiJoin first, because for queries like select *, a in (select a from t t2 where t2.b = t1.b) from t t1, if we de-correlate it to a normal join, t1.a = t2.a and t1.b = t2.b cannot be differentiated to behave differently regarding null input.

@eurekaka
Copy link
Contributor Author

eurekaka commented Dec 28, 2018

@zz-jason AntiLeftOuterSemiJoin works now, PTAL. LeftOuterSemiJoin would be handled later in a separate PR.

mysql> select * from s;
+------+------+
| a    | b    |
+------+------+
|    1 | NULL |
|    2 |    1 |
+------+------+
2 rows in set (0.00 sec)

mysql> select *, a not in (select b from s) from s;
+------+------+----------------------------+
| a    | b    | a not in (select b from s) |
+------+------+----------------------------+
|    1 | NULL |                          0 |
|    2 |    1 |                       NULL |
+------+------+----------------------------+

@eurekaka
Copy link
Contributor Author

/rebuild

@zz-jason zz-jason self-requested a review January 6, 2019 07:31
}

for inner := inners.Current(); inner != inners.End(); inner = inners.Next() {
j.makeShallowJoinRow(j.outerIsRight, inner, outer)

matched, err = expression.EvalBool(j.ctx, j.conditions, j.shallowRow.ToRow())
matched, _, err = expression.EvalBool(j.ctx, j.conditions, j.shallowRow.ToRow())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also maintain the correctness of hasNull for semi join? If not, how about adding some comment about why we don't care whether there is any NULL value in the inner side and make the return value hasNull to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For semi join, false and null are same because in both cases the outer row would not be outputted. I will add some comments.

}

for inner := inners.Current(); inner != inners.End(); inner = inners.Next() {
j.makeShallowJoinRow(false, inner, outer)

matched, err = expression.EvalBool(j.ctx, j.conditions, j.shallowRow.ToRow())
matched, _, err = expression.EvalBool(j.ctx, j.conditions, j.shallowRow.ToRow())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "left outer semi join", I think we should make the return value hasNull be correct. Because in this join method, what we really needed is the join result: matched or is NULL. For example, if we don't consider hasNull, we won't be able to produce the correct result for this query:

TiDB(root@127.0.0.1:test) > desc select a in (select b from t) from t;
+-------------------------+-------+------+----------------------------------------------------------------------------+
| id                      | count | task | operator info                                                              |
+-------------------------+-------+------+----------------------------------------------------------------------------+
| Projection_6            | 4.00  | root | 5_aux_0                                                                    |
| └─HashLeftJoin_7        | 4.00  | root | left outer semi join, inner:TableReader_11, equal:[eq(test.t.a, test.t.b)] |
|   ├─TableReader_9       | 4.00  | root | data:TableScan_8                                                           |
|   │ └─TableScan_8       | 4.00  | cop  | table:t, range:[-inf,+inf], keep order:false, stats:pseudo                 |
|   └─TableReader_11      | 4.00  | root | data:TableScan_10                                                          |
|     └─TableScan_10      | 4.00  | cop  | table:t, range:[-inf,+inf], keep order:false, stats:pseudo                 |
+-------------------------+-------+------+----------------------------------------------------------------------------+
6 rows in set (0.00 sec)

TiDB(root@127.0.0.1:test) > select a in (select b from t) from t;
+------------------------+
| a in (select b from t) |
+------------------------+
|                      1 |
|                      0 |
|                      0 |
|                      0 |
+------------------------+
4 rows in set (0.00 sec)

the table was generated by:

create table t(a bigint, b bigint, c bigint);
insert into t values(1, 1, 1), (null, 2, 2), (3, null, 3), (4, 400, 400);

the correct result is:

MySQL(root@localhost:test) > select a in (select b from t) from t;
+------------------------+
| a in (select b from t) |
+------------------------+
|                      1 |
|                   NULL |
|                   NULL |
|                   NULL |
+------------------------+
4 rows in set (0.00 sec)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done after we differentiate whether the join key is from in or not, for example, here is the output from MySQL:

mysql> select a in (select b from t) from t;
+------------------------+
| a in (select b from t) |
+------------------------+
|                      1 |
|                   NULL |
|                   NULL |
|                   NULL |
+------------------------+
4 rows in set (0.00 sec)

mysql> select exists(select * from t t2 where t2.b = t1.a) from t t1;
+----------------------------------------------+
| exists(select * from t t2 where t2.b = t1.a) |
+----------------------------------------------+
|                                            1 |
|                                            0 |
|                                            0 |
|                                            0 |
+----------------------------------------------+
4 rows in set (0.00 sec)

mysql> select a in (select a from t t2 where t2.b = t1.a) from t t1;
+---------------------------------------------+
| a in (select a from t t2 where t2.b = t1.a) |
+---------------------------------------------+
|                                           1 |
|                                           0 |
|                                           0 |
|                                           0 |
+---------------------------------------------+
4 rows in set (0.00 sec)

}

func (j *leftOuterSemiJoiner) onMatch(outer chunk.Row, chk *chunk.Chunk) {
chk.AppendPartialRow(0, outer)
chk.AppendInt64(outer.Len(), 1)
}

func (j *leftOuterSemiJoiner) onMissMatch(outer chunk.Row, chk *chunk.Chunk) {
func (j *leftOuterSemiJoiner) onMissMatch(hasNull bool, outer chunk.Row, chk *chunk.Chunk) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After maintaining the correct hasNull value, if hasNull is true, we should append a NULL value to make the correct join result.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@eurekaka
Copy link
Contributor Author

The issue is fixed by another new PR: #9051, so we can close this now.

@eurekaka eurekaka closed this Feb 25, 2019
@eurekaka eurekaka deleted the anti-join-apply branch February 25, 2019 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/execution SIG execution sig/planner SIG: Planner status/LGT1 Indicates that a PR has LGTM 1. type/bugfix This PR fixes a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants